Validation Issues induced by an Automatic Pre-Annotation Mechanism in the Building of Non-projective Dependency Treebanks
نویسندگان
چکیده
In order to build large dependency treebanks using the CDG Lab, a grammar-based dependency treebank development tool, an annotator usually has to fill a selection form before parsing. This step is usually necessary because, otherwise, the search space is too big for long sentences and the parser fails to produce at least one solution. With the information given by the annotator on the selection form the parser can produce one or several dependency structures and the annotator can proceed by adding positive or negative annotations on dependencies and launching iteratively the parser until the right dependency structure has been found. However, the selection form is sometimes difficult and long to fill because the annotator must have an idea of the result before parsing. The CDG Lab proposes to replace this form by an automatic pre-annotation mechanism. However, this model introduces some issues during the annotation phase that do not exist when the annotator uses a selection form. The article presents those issues and proposes some modifications of the CDG Lab in order to use effectively the automatic pre-annotation mechanism.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملLabel Pre-annotation for Building Non-projective Dependency Treebanks for French
The current interest in accurate dependency parsing make it necessary to build dependency treebanks for French containing both projective and non-projective dependencies. In order to alleviate the work of the annotator, we propose to automatically pre-annotate the sentences with the labels of the dependencies ending on the words. The selection of the dependency labels reduces the ambiguity of t...
متن کاملOn Detecting Errors in Dependency Treebanks
Dependency relations between words are increasingly recognized as an important level of linguistic representation that is close to the data and at the same time to the semantic functor-argument structure as a target of syntactic analysis and processing. Correspondingly, dependency structures play an important role in parser evaluation and for the training and evaluation of tools based on depend...
متن کاملBuilding gold-standard treebanks for Norwegian
Språkbanken at the National Library of Norway is currently building up gold-standard Dependency Grammar treebanks for Norwegian Bokmål and Nynorsk. The treebanks are manually annotated for morphological features, syntactic functions and dependency relations. This paper explains the choice of texts and format of the treebanks, some key aspects of the morphological and syntactic annotation, and i...
متن کاملAutomatic Error Detection in Annotated Corpora
Annotated corpus is a linguistic resource which explicitly encodes the information at syntactic and semantic levels for each sentence. Annotated corpora play a crucial role in many applications of natural language processing (NLP). Error free and consistent annotated corpora is vital for these applications. Creating annotated corpora is an expensive and time consuming process. Errors or anomali...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014